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C02 Forest: Improved Random Forest by 
Continuous Optimization of Oblique Splits 

Mohammad Norouzi, Maxwell D. Collins, David J. Fleet, Pushmeet Kohli 


Abstract —We propose a novel algorithm for optimizing multivariate linear threshold functions as split functions of decision trees to 
create improved Random Forest classifiers. Standard tree induction methods resort to sampling and exhaustive search to find good 
univariate split functions. In contrast, our method computes a linear combination of the features at each node, and optimizes the 
parameters of the linear combination (oblique) split functions by adopting a variant of latent variable SVM formulation. We develop a 
convex-concave upper bound on the classification loss for a one-level decision tree, and optimize the bound by stochastic gradient 
descent at each internal node of the tree. Forests of up to 1000 Continuously Optimized Oblique (C02) decision trees are created, 
which significantly outperform Random Forest with univariate splits and previous techniques for constructing oblique trees. 
Experimental results are reported on multi-class classification benchmarks and on Labeled Faces in the Wild (LFW) dataset. 

Index Terms —decision trees, random forests, oblique splits, ramp loss 
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1 Introduction 

Decision trees 00 @ and random forests 113] have 
a long, successful history in machine learning, in part due 
to their computational efficiency and their applicability to 
large-scale classification and regression tasks (e.g., see (7j, 
111 j). A case in point is the Microsoft Kinect, where multiple 
decision trees are learned on millions of training exemplars 
to enable real time human pose estimation from depth im¬ 
ages |27) . The standard algorithm for decision tree induction 
grows a tree one node at a time, greedily and recursively 
The building block of this procedure is an optimization at 
each internal node of the tree, which divides the training 
data at that node into two subsets according to a splitting 
criterion, such as Gini impurity index in CART |6), or infor¬ 
mation gain in C4.5 |[25). This corresponds to optimizing a 
binary decision stump , or a one-level decision tree, at each 
internal node. Most tree-based methods exploit univariate 
(axis-aligned) split functions, which compare one feature 
dimension to a threshold. Optimizing univariate decision 
stumps is straightforward because one can exhaustively 
enumerate all plausible thresholds for each feature, and 
thereby select the best parameters according to the split 
criterion. Conversely, univariate split functions have limited 
discriminative power. 

We investigate the use of a more general and pow¬ 
erful family of split functions, namely, linear-combination 
(i a.k.a ., oblique) splits. Such split functions comprise a mul¬ 
tivariate linear projection of the features followed by bi¬ 
nary quantization. Clearly, exhaustive search with linear 
hyperplanes is not feasible, and based on our preliminary 
experiments, random sampling yields poor results. Further, 
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typical splitting criteria for a one-level decision tree (de¬ 
cision stump) are discontinuous, since small changes in 
split parameters may change the assignment of data to 
branches of the tree. As a consequence, split parameters are 
not readily amenable to numerical optimization, so oblique 
split functions have not been used widely with tree-based 
methods. 

This paper advocates a new building block for learning 
decision trees, i.e., an algorithm for continuous optimization 
of oblique decision stumps. To this end, we introduce a 
continuous upper bound on the empirical loss associated 
with a decision stump. This upper bound resembles a ramp 
loss, and accommodates any convex loss that is useful for 
multi-class classification, regression, or structured predic¬ 
tion |[22j. As explained below, the bound is the difference of 
two convex terms, the optimization of which is effectively 
accomplished using the Convex-Concave Procedure of j33j. 
The proposed bound resembles the bound used for learning 
binary hash functions |20[. 

Some previous work has also considered improving the 
classification accuracy of decision trees by using oblique 
split functions. For example, Murthy et al. (l9j proposed a 
method called OC1, which yields some performance gains 
over CART and C4.5. Nevertheless, individual decision trees 
are rarely sufficiently powerful for many classification and 
regression tasks. Indeed, the power of tree-based methods 
often arises from diversity among the trees within a forest. 
Not surprisingly, a key question with optimized decision 
trees concerns the loss of diversity that occurs with op¬ 
timization, and hence a reduction in the effectiveness of 
forests of such trees. The random forest of Breiman seems 
to achieve a good balance between optimization and ran¬ 
domness. 

Our experimental results suggest that one can effectively 
optimize oblique split functions, and the loss of diversity 
associated with such optimized decision trees can be miti¬ 
gated. In particular, it is found that when the decision stump 
optimization is initialized with random forest's split func- 
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tions, one can indeed construct a forest of non-correlated 
decision trees. We effectively take advantage of the underly¬ 
ing non-convex optimization problem, for which a diverse 
set of initial states for the optimizer yields a set of different 
split functions. Like random forests, the resulting algorithm 
achieves very good performance gains as the number of 
trees in the ensemble increases. 

We assess the effectiveness of our tree construction al¬ 
gorithm by generating up to 1000 decision trees on nine 
classification benchmarks. Our algorithm, called C02 forest, 
outperforms random forest on all of the datasets. It is also 
shown to outperform a baseline of OC1 trees. As a large- 
scale experiment, we consider the task of segmenting faces 
from the Labeled Faces in the Wild (LFW) dataset (l4| . 
Again, our results confirm that C02 forest outperforms 
other baselines. 

2 Related Work 

Breiman et al. 161 proposed a version of CART that em¬ 
ploys linear combination splits, known as CART-linear- 
combination (CART-LC). Murthy et al. [12], Jl9| proposed 
OC1, a refinement of CART-LC that uses random restarts 
and random perturbations to escape local minima. The main 
idea behind both algorithms is to use coordinate descent to 
optimize the parameters of the oblique splits one dimension 
at a time. Keeping all of the weights corresponding to an 
oblique decision stump fixed except one, for each datum 
they compute the critical value of the missing weight at 
which the datum switches its assignment to the branches. 
Then, one can sort these critical values to find the optimal 
value of each weight (with other weights fixed). By per¬ 
forming multiple passes over the dimensions and the data, 
oblique splits with small empirical loss can be found. 

By contrast, our algorithm updates all the weights simul¬ 
taneously using gradient descent. While the aforementioned 
algorithms focus mainly on optimizing the splitting crite¬ 
rion to minimize tree size, there is little promise of improved 
generalization. Here, by adopting a formulation based on 
the latent variable SVM |32|, our algorithm provides a nat¬ 
ural means of regularizing the oblique split stumps, thereby 
improving the generalization power of the trees. 

The hierarchical mixture of experts (HME) [161 uses 
soft splits rather than hard binary decisions to capture 
situations where the transition from low to high response 
is gradual. The empirical loss associated with HME is a 
smooth function of the unknown parameters and hence 
numerical optimization is feasible. The main drawback of 
HME concerns inference. That is, multiple paths along the 
tree should be explored during inference, which reduces the 
efficiency of the classifier. 

Our work builds upon random forest j5][. Random forest 
combines bootstrap aggregating (bagging) [4| and the ran¬ 
dom selection of features 113 [ to construct an ensemble of 
non-correlated decision trees. The method is used widely for 
classification and regression tasks, and research still investi¬ 
gates its theoretical characteristics j8|. Building on random 
forest, we also grow each tree using a bootstrapped version 
of the training dataset. The main difference is the way 
the split functions are selected. Training random forest is 
generally faster than using our optimized oblique trees, and 


because random forest uses univariate splits, classification 
with the same number of trees is often faster. Nevertheless, 
we often achieve similar accuracy with many fewer trees, 
and depending on the application, the gain in classification 
performance is clearly worth the computational overhead. 

There also exist boosting based techniques for creating 
ensembles of decision trees (lO), (34). A key benefit of 
random forest over boosting is that it allows for faster 
training as the decision trees can be trained in parallel. In 
our experiments we usually train 30 trees in parallel on a 
multicore machine. Nevertheless, it is interesting to combine 
boosting techniques with our oblique trees, and we leave 
this to future work. 

Menze et al. fl8[ also consider a variant of oblique 
random forest. At each internal node they find an optimal 
split function using either ridge regression or linear dis¬ 
criminant analysis. Like other previous work (3), 1301, the 
technique of |18| is only conveniently applicable to binary 
classification tasks. A big challenge in a multi-class setting 
is solving the combinatorial assignment of labels to the two 
leaves. In contrast to |jl8), our technique is more general, 
and allows for optimization of multi-class classification and 
regression loss functions. 

Rota Bulo & Kontschieder 1261 recently proposed the 
use of multi-layer neural nets as split functions at internal 
nodes. While extremely powerful, the resulting decision 
trees lose their computational simplicity during training and 
testing. Further, it may be difficult to produce the required 
diversity among trees in a forest. This paper explores the 
middle ground, with a simple, yet effective class of linear 
multi-variate split functions. That said, note that the formu¬ 
lation of the upper bound used to optimize empirical loss 
in this paper can be extended to optimize other non-linear 
split functions, including neural nets (e.g., (2l)). 

3 Preliminaries 

For ease of exposition, this paper is focused on binary 
classification trees, with m internal (split) nodes, and m + 1 
leaf (terminal) nodes^An input, x E is directed from 
the root of the tree down through internal nodes to a leaf 
node, which specifies a distribution over k class labels. 

Each internal node, indexed by i E {1,..., m}, performs 
a binary test by evaluating a node-specific split function 
ti(x) : R p —>> { —1, +1}. If U{x) evaluates to —1, then x is 
directed to the left child of node i. Otherwise, x is directed to 
the right child. And so on down the tree. Each split function 
ti(-), parametrized by a weight vector w i, is assumed to be a 
linear threshold function of the form £^(x) = sgn(w^ T x). We 
incorporate an offset parameter to obtain split functions of 
the form sgn(w^ T x — bf) by using homogeneous coordinates 
(i i.e ., by appending a constant 1" to the end of the input 
feature vector). 

Each leaf node, indexed by j E {0,..., m}, speci¬ 
fies a conditional probability distribution over class labels, 
l E {1,..., k}, denoted p(y = l | j). These distributions are 
parameterized in terms of a vector of unnormalized predic- 

1. In a binary tree the number of leaves is always one more than the 
number of internal (non-leaf) nodes. 
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tive log-probabilities, denoted Oj G M fe , and a conventional 
softmax function; i.e., 


p(y = i\ j) 


]} 

Ea=l6Xp{®j[a]} ’ 


(1) 


where V[ a ] denotes the cd h element of vector v. 

The parameters of the tree comprise the m internal 
weight vectors, each of dimension p + 1, and the m + 1 
vectors of unnormalized log-probabilities, one for each leaf 
node, i.e., and {0j}jT o . Given a dataset of input- 

output pairs, V = {x 2 ,2/^}” =1 , where Vz e {1,..., fc} is the 
ground truth class label associated with input x z G M. p , we 
wish to find a joint configuration of oblique splits {w x 
and leaf parameters that minimize some measure of 

misclassification loss on the training set. Joint optimization 
of the split functions and leaf parameters according to a 
global objective is, however, known to be extremely chal¬ 
lenging 115j due to the discrete and sequential nature of the 
decisions within the tree. 

To cope with the discontinuous objective caused by 
discrete split functions, we propose a smooth upper bound 
on the empirical loss, with which one can effectively learn 
a diverse collection of trees with oblique split functions. We 
apply this approach to the optimization of split functions 
of internal nodes within the context of a top-down greedy 
induction procedure, one in which each internal node is 
treated as an independent one-level decision stump. The 
split functions of the tree are optimized one node at a time, 
in a greedy fashion as one traverses the tree, breadth first, 
from the root downward. The procedure terminates when a 
desired tree depth is reached, or when some other stopping 
criterion is met. While we focus here on the optimization 
of a single stump, the formulation can be generalized to 
optimize entire trees. 


4 Continuous Optimization of Oblique 
(C02) Decision Stumps 

A binary decision stump is parameterized by a weight 
vector w, and two vectors of unnormalized log-probabilities 
for the two leaf nodes, 0o and 0\. The stump's loss function 
comprises two terms, one for each leaf, denoted £{Qo,y) and 
£(Qi,y), where £ : R k x {1,..., k} —)> M + . They measure 
the discrepancy between the label y and the distributions 
parameterized by 0o and 0\. The binary test at the root of 
the stump acts as a gating function to select a leaf, and hence 
its associated loss. The empirical loss for the stump, i.e., the 
sum of the loss over the training set V, is defined as 

£(w, 0 o ,i;£>) = 

Y l(w T x < 0)£(d 0 ,y) + l(w T x > O)£(0i,y) , ( 2 ) 
(x,y)ev 

where l(-) is the usual indicator function. Given the softmax 
model of Eq. (l|, the log loss takes the form 

4> g (&,y) = + lo s|X]l =1 ex P{ 0 H}} • ( 3 ) 

Regarding this formulation, we note that the parameters 
which minimize Eq. (5) with the log loss, £\ og , are those 
that maximize information gain. One can prove this with 


straightforward algebraic manipulation of Eq. |2j, recogniz¬ 
ing that the 0o and 6\ that minimize Eq. |2), given any w, 
are the empirical class log-probabilities at the leaves. 

We also note that the framework outlined below ac¬ 
commodates other loss functions that are convex in 0. For 
instance, for regression tasks where y G one can use 
squared loss, 

4qr(0,y) = ||0-y||l • (4) 

As mentioned already above, it is also important to note 
that empirical loss, £(w, 0o,i; ^), is a discontinuous func¬ 
tion of w. As a consequence, optimization of C with respect 
to w is very challenging. Our approach, outlined in detail 
below, is to instead optimize a continuous upper bound on 
empirical loss. This bound is closely related to formulations 
of binary SVM and logistic regression classification. In the 
case of binary classification, the assignment of class labels 
to each side of the hyperplane, i.e., the parameters 0o and 
0 1 , are pre-specified. In contrast, a decision stump with a 
large numbers of labels entails joint optimization of both the 
assignment of the labels to the leaves and the hyperplane 
parameters. 

4.1 Upper Bound on Empirical Loss 

The upper bound on loss that we employ, given an input- 
output pair (x, y), has the following form: 

l(w T x < 0 )£(6 0 ,y) + l(w T x > 0)£(9 1 ,y) < 

max (- w J x+£(0 0 ,y) , w T x + £(0i,y)) - |w T x| , 

where |w T x| denotes the absolute value of w T x. To verify 
the bound, first suppose that w T x < 0. In this case, it is 
straightforward to show that the inequality reduces to 

£(Oo,y) < ma x(£(0 o ,y) , 2 w T x + £(0 lt y)) , (6) 

which holds trivially. Conversely, when w T x > 0 the in¬ 
equality reduces to 

£{Q\,y) < max (—2 w T x + £(6q, y) , £(0i,y)) , (7) 

which is straightforward to validate. Hence the inequality 
in Eq. holds. 

Interestinly, while empirical loss in Eq. |2) is invariant 
to ||w||, the bound in Eq. © is not. That is, for any real 
scalar a > 0, sgn(aw T x) does not change with a, and hence 
C( w, 0 o ,i) = £(aw, 0 q, i)- Thus, while the loss on the LHS 
of Eq. J5) is scale-invariant, the upper bound on the RHS 
of Eq. (5} does depend on ||w||. Indeed, like the soft-margin 
binary SVM formulation, and margin rescaling formulations 
of structural SVM (3l| , the norm of w affects the interplay 
between the upper bound and empirical loss. In particular, 
as the scale of w increases, the upper bound becomes tighter 
and its optimization becomes more similar to a direct loss 
minimization. 

More precisely, the upper bound becomes tighter as || w|| 
increases. This is evident from the following inequality, 
which holds for any real scalar a > 1: 

max(-w T x+^(0 o ,y), w T x+^(0i,y)) - |w T x| > 

max (— aw T x+^(0o, y), aw T x+£(#i, y)) — a|w T x| . 
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To verify the bound, as above, consider the sign of w T x. 
When w T x < 0, inequality in Eq. {8} is equivalent to 

max(^(0 o ,7/) , 2w t x + £(0i,?/)) > 

max (f?(0o, y) , 2aw T x + £(0i, 7/)) . 

Conversely, when w T x > 0, Eq. {8} is equivalent to 

max(- 2w T x+f(0 o ,i/) , £(0\,y)) > 

max (—2 aw T x + ^(0 O , y) , ^(0i,y)) . 

Thus, as ||w|| increases the bound becomes tighter. In the 
limit, as ||w|| becomes large, the loss terms £(Oo,y) and 
£(0i,y) become negligible compared to the terms — w T x 
and w T x, in which case the RHS of Eq. (5) equals its LHS, 
except when w T x ~ 0. Hence, for large ||w||, not only the 
bound gets tight, but also it becomes less smooth and more 
difficult to optimize in our nonconvex setting. 

From the derivation above, and through experiments 
below, we observe that when ||w|| is constrained, opti¬ 
mization converges to better solutions that exhibit better 
generalization. Summing over the bounds for the training 
pairs, and restricting || w||, we obtain the surrogate objective 
we aim to optimize to find the decision stump parameters: 

minimize £ (w, 0q,i ; v) 

such that ||w|| 2 < v , ^ 

where v G M + is a regularization parameter, and £ is the 
surrogate objective, i.e., the upper bound, 

£ (w,0 o? i;P,z/) = 

^2 max (-w T x+£(6 0 ,y), w T x+£(0i, y)) - |w T x| . 

(x,y)£V 

(10) 

For all values of v, we have that £'(w, 0 o ,i; V, v) > 
w, 0 0j i;X>). We find a suitable v via cross-validation. In¬ 
stead of using the typical Lagrange form for regularization, 
we employed hard constraints with similar behavior. 

4.2 Convex-Concave Optimization 

Minimizing the surrogate objective in Eq. (To) entails non¬ 
convex optimization. While still challenging, it is important 
that £ (w, 0o,i ;£>, v) is better behaved than empirical loss. 
It is piecewise smooth and convex-concave in w, and the 
constraint on w defines a convex set. As a consequence, 
gradient-based optimization is applicable, although the sur¬ 
rogate objective is non-differentiable at isolated points. The 
objective also depends on the leaf parameters, 0o and 0i, 
but only through the loss terms i, which we constrained 
to be convex in 0. Therefore, for a fixed w, it follows that 
£ (w, 0 o ,i; T>, v) is convex in 0 O and 0i. 

The convex-concave nature of the surrogate objective 
allows us to use difference of convex (DC) programming, 
or the Convex-Concave Procedure (CCCP) |33||, a method 
for minimizing objective functions expressed as sum of a 
convex and a concave term. The CCCP has been employed 
by Felzenszwalb et al. (9) and Yu & Joachims (32) to op¬ 
timize latent variable SVM models that employ a similar 
convex-concave surrogate objective. 

The Convex-Concave Procedure is an iterative method. 
At each iteration the concave term (—|w T x| in our case) 


Algorithm 1 The convex-concave procedure for Continuous 
Optimization of Oblique (C02) decision stumps that mini¬ 
mizes Eq. (9) to estimate (w, 0 O , 0i) given a training dataset 
V, and a hyper-parameter v that constrains the norm of w 

1: Initialize w by a random univariate split 
2: Estimate 0o, and 0i based on w and V 
3: while surrogate objective has not converged do 

4: w(°^) i — W 

5: for t = 1 to r do 

6: sample a pair (x, y) at random from V 

7: S Sgn(w( oZd )x) 

8: if —w T x + £(6o : y) > w T x + £(Gi,y) then 

9: w G- w + 77(1 + s)x 

• 10: 0 O <- 00 - T) d£(0 0 , y)/d0 

11: else 

12: W W — ?7(1 — s)x 

13: 0i 0i — r]d£(0i,y)/d0 

14: end if 

15: if ||wjll > v then 

16: W y/u • w/||w||2 

17: end if 

18: end for 

19: end while 


is replaced with its tangent plane at the current parameter 
estimate, to formulate a convex subproblem. The param¬ 
eters are updated with those that minimize the convex 
subproblem, and then the tangent plane is updated. Let 
w (°id) derate the estimate for w from the previous CCCP 
iteration. In the next iteration w^ old \ 0q, and 0i are updated 
minimizing 

D (max(-w T x+^( 0 o ,y),w T x+^( 0 i,y)) 

(x,2/)ex> (ii) 

- sgn(w^ oZ ^ T x) w T x^ , 

such that ||w || 2 <y. 

Note that w^ old ^ is constant during optimization of this 
CCCP subproblem. In that case, the second term within the 
sum over training data in Eq. (IT) just defines a hyperplane 
in the space of w. The other (first) term within the sum 
entails maximization of a function that is convex in w, 00 
and 0 i, since the maximum of two convex functions is 
convex. As a consequence, the objective of Eq. GD is convex. 

We use stochastic subgradient descent to minimize 
Eq. (IT) . After each subgradient update, w is projected back 
into the feasible region. For efficiency, we do not wait for 
complete convergence of the convex subproblem within 
CCCP. Instead, w^ old ^ is updated after a fixed number of 
epochs (denoted r) over the training dataset. The pseu¬ 
docode for the optimization procedure is outlined in Alg[Tj 

In practice, we implement Alg [l] with several small 
modifications. Instead of estimating the gradients based on 
a single data point, we use mini-batches of 100 elements, 
and average their gradients. We also use a momentum term 
of 0.9 to converge more quickly. Finally, although a constant 
learning rate 77 is used in Alg [lj we instead track the value 
of the surrogate objective, and when it oscillates for more 
than a number of iterations we reduce the learning rate. 
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5 Implementation and Experimental Details 

In all tree construction methods considered here, we grow 
each decision tree as deep as possible, until we reach a pure 
leaf. We exploit the bagging ensemble learning algorithm (4) 
to create the forest such that each tree is built by using 
a new data set sampled uniformly with replacement from 
the original dataset. In Random Forest, for finding each 
univariate split function we only consider a candidate set 
of size q of random feature dimensions, where q is the 
only hyper-parameter in our random forest implementation. 
We set the parameter q by growing a forest of 1000 trees 
and testing them on a hold-out validation set of size 20 % 
of the training set. Let p denote the dimensionality of the 
feature descriptors. We choose q from the candidate set 
of {p 0 , 5 ,p 0 , 6 ,p 0 , 7 ,p 0 , 8 ,p 0,9 } to accelerate validation. Some 
previous work suggests the use of q = y fp as a heuristic 0 ' 
which is included in the candidate set. 

We use an OC1 implementation provided by the au¬ 
thors 11 ]. We slightly modified the code to allow trees to 
grow to their fullest extent, removing hard-coded limits on 
tree depth and minimum examples for computing splits. 
We also modified the initialization of OC1 optimization to 
match our initialization for C02, whereby an optimal axis- 
aligned split on a subsampling of q possible features is 
used. Interestingly, we observed that both changes improve 
OCl's performance when building ensembles of multiple 
trees, OC1 Forest. We use the default values provided by the 
authors for OC1 hyperparameters. 

C02 Forest has three hyper-parameters, namely, the reg¬ 
ularization parameter v, the initial learning rate 77 , and q, 
the size of feature candidate set of which the best is selected 
to initialize the C02 optimization. Ideally, one may consider 
using different regularizer parameters for different internal 
nodes of the tree, since the number of available training data 
decreases as one descends the tree. However, we use the 
same regularizer and learning rate for all of the nodes to 
keep hyper-parameter tuning simple. We set q as selected 
by the random forest validation above. We perform a grid 
search over v and 77 to select the best hyper-parameters. 

6 Experiments 

Before presenting the classification results, we investigate 
the impact of the hyper-parameter v on our oblique decision 
trees. Fig.[l]depicts training and validation error rates for the 
MNIST dataset for different values of z' G {0.1,1,10,100} 
and different tree depths. One can see that as the tree 
depth increases, training error rate decreases monotonically. 
However, validation error rate saturates at a certain depth, 
e.g., a depth of 10 for MNIST. Growing the trees deeper 
beyond this point, either has no impact, or slightly hurts 
the performance. From the plots it appears that v = 10 
exhibits the best training and validation error rates. The 
difference between different values of v seems to be larger 
for validation error. 

As shown above in Eq. as v increases the upper 
bound becomes tighter. Thus, one might suspect that larger 
v implies a better optimum and better training error rates. 
However, increasing v not only tightens the bound, but also 
makes the objective less smooth and harder to optimize. For 
MNIST, at v sb= 10 there appears to be a reasonable balance 


between the tightness of the bound and the smoothness of 
the objective. The hyper-parameter v also acts as a regular¬ 
izer, contributing to the large gap in the validation error 
rates. For completeness, we also include baseline results 
with univariate decision trees. Clearly, the C02 trees reach 
the same training error rates as the baseline but at a smaller 
depth. As seen from the validation error rates, the C02 trees 
achieve better generalization too. 

Classification results for tree ensembles are generally 
much better than a single tree. Here, we compare our Contin¬ 
uously Optimized Oblique (C02) decision forest with random 
forest [5j| and OC1 forest, forest built using OC1 [l9|. Results 
for random forest are obtained with the implementation of 
the scikit-learn package (23). Both of the baselines use infor¬ 
mation gain as the splitting criterion for learning decision 
stumps. We do not directly compare with other types of 
classifiers, as our research concerns tree-based techniques. 
Nevertheless, the reported results are often competitive with 
the state-of-the-art. 

6.1 UCI multi-class benchmarks 

We conduct experiments on nine UCI multi-class bench¬ 
marks, namely, Satlmage, USPS, Pendigits, Letter, Protein, 
Connect4, MNIST, SensIT, Covertype. Table [l] provides a 
summary of the datasets, including the number of training 
and test points, the number of class labels, and the feature 
dimensionality. We use the training and test splits set by 
previous work, except for Connect4 and Covertype. More 
details about the datasets, including references to the corre¬ 
sponding publications can be found at the LIBSVM dataset 
repository page ( 2 ). 

Test error rates for random forest, OC1 Forest, and C02 
Forest with different numbers of trees (10, 30, 1000) are 
reported in Table [l] OC1 results are not presented on some 
datasets, as the derivative-free coordinate descent method 
used does not scale to large or high-dimensional datasets, 
e.g., requiring more than 24 hours to train a single tree 
on MNIST. C02 Forest consistently outperforms random 
forest and OC1 Forest on all of the datasets. In some cases, 
i.e., Covertype, and Satlmage, the improvement is small, 
but in four of the datasets C02 Forest with only 10 trees 
outperforms random forest with 1000 trees. 

For all methods, there is a large performance gain when 
the number of trees is increased from 10 to 30. The marginal 
gain from 30 to 1000 trees is less significant, but still notable. 
Finally, we also plot test error curves as a function of log 
number of trees in Fig. [2] C02 Forest outperforms random 
forest and OC1 by a large margin and in most cases the 
marginal gain persists across different number of trees. For 
some datasets, OC1 Forest outperforms random forest, but 
it consistently underperforms C02 Forest. 

For pre-processing, the datasets are scaled so that either 
the feature dimensions are in the range of [ 0 , 1 ], or they have 
a zero mean and a unit standard deviation. For C02 Forest, 
we select v from the set {0.1,1,4,10,43,100}, and 77 from 
the set {.03, .01, .003}. A validation of 30 decision trees is 
performed over 18 entries of the grid of (z/, 77 ). 

6.2 Labeled Faces in the Wild (LFW) 

As a large-scale experiment, we consider the task of seg¬ 
menting face parts based on the Labeled Faces in the Wild 
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Fig. 1. The impact of hyper-parameter v on MNIST training and validation error rates for C02 decision trees. The dashed baseline represents 
univariate decision trees with no pruning. 


Dataset Information 

Test error (%) with different number of trees 

Ran 

dom Forest 

30 1000 

OC1 Forest 

10 30 1000 

C02 Forest 

10 30 1000 

Name 

#Train 

#Test 

#Class 

Dim 

10 

Satlmage 

4,435 

2,000 

6 

36 

10.1 

9.4 

8.9 

10.1 

9.9 

9.5 

9.6 

9.1 

8.9 

USPS 

7,291 

2,007 

10 

256 

9.0 

7.2 

6.4 

7.1 

7.1 

6.8 

5.8 

5.9 

5.5 

Pendigits 

7,494 

3,498 

10 

16 

3.9 

3.3 

3.5 

3.2 

2.2 

2.3 

1.8 

1.7 

1.7 

Letter 

15,000 

5,000 

26 

16 

6.6 

4.7 

3.7 

7.6 

5.0 

3.8 

3.2 

2.3 

1.8 

Protein 

17, 766 

6,621 

3 

357 

39.9 

35.5 

30.9 

39.1 

34.6 

30.8 

33.8 

31.2 

30.3 

Connect4* 

55,000 

12,557 

3 

126 

18.9 

17.4 

16.2 


N/A 


17.1 

15.7 

14.7 

MNIST 

60,000 

10,000 

10 

784 

4.5 

3.5 

2.8 


N/A 


2.5 

2.0 

1.9 

SensIT 

78,823 

19,705 

3 

100 

15.5 

14.0 

13.4 


N/A 


14.1 

13.0 

12.5 

Covertype* 

500,000 

81,012 

7 

54 

3.2 

2.8 

2.6 


N/A 


3.1 

2.7 

2.6 


TABLE 1 

Test error rates for forests with different number of trees on multi-class classification benchmarks. First few columns provide dataset information. 
Test error rates (%) for random forest, OC1 Forest, and C02 Forest with 10, 30, and 1000 trees are reported. For datasets marked with a star “*” 
(i.e., Connect4 & Covertype) we use our own training and test splits. As the number of training data points and feature dimensionality increase, 
OC1 becomes prohibitively slow, so this method is not applicable to the datasets with high-dimensional data or large training sets. 


(LFW) dataset fM]|. We seek to label image pixels that belong 
to one of the following 7 face parts: lower face, nose, mouth, 
and the left and right eyes and eyebrows. These parts should 
be differentiated from the background, which provides a 
total of 8 class labels. To address this task, decision trees are 
trained on 31 x 31 image sub-windows to predict the label 
of the center pixel. Each 31 x 31 window with three RGB 
channels is vectorized to create an input in M 2883 . We ignore 
part labels for a 15-pixel border around each image at both 
training and test time. 

To train each tree, we subsample 256,000 sub-windows 
from training images. We then normalize the pixels of each 
window to be of unit norm and variance across the training 
set. The same transformation is applied to input windows 
at test time using the normalization parameters calculated 
on the training set. To correct for the class label imbalance, 
like (17) , we subsample training windows so that each label 
has an equal number of training examples. At test time, we 
reweight the class label probabilities given by the inverse of 
the factor that each label was undersampled or oversampled 
during training. 

Other than the random forest baseline, we also train 
decision trees and forests using split functions that compare 
two features (or "probes"), where the choice of features 
comes from finding the optimal pair of features out of 


a large number of sampled pairs. This method produces 
decision forests analogous to [27). We call this baseline Two- 
probe Forest. The same technique can be used to generate 
split functions with several features, but we found that 
using only two features produces the best accuracy on the 
validation set. 

Because of the class label imbalance in LFW, classifica¬ 
tion accuracy is a poor measure of the segmentation quality. 
A more informative performance measure, also used in the 
PASCAL VOC challenge, is the class-average Jaccard score. 
We report Jaccard scores for the baselines vs. C02 Forest 
in Table [5] It is clear that Two-probe Forest outperforms 
random forest, and C02 Forest outperforms both of the 
baselines considerably. The superiority of C02 Forest is con¬ 
sistent in Fig. [3] where Jaccard scores are depicted for forests 
with fixed tree depths, and forests with different number of 
trees. The Jaccard score is calculated for each class label, 
against all of the other classes, as 100 • tp/(tp + fp + fn ). 
The average of this quantity over classes is reported here. 
The test set comprises 250 randomly chosen images. 

We use the Jaccard score to select the C02 hy¬ 
perparameters v and rj. We perform grid search over 
T] e {10 -5 ,10 -4 ,3 • 10 -4 , 6 • 10- 4 , 0.001, 0.003} and v G 
{0.1,1,4,10,43,100}. We compare the scores for 16 trees 
on a held-out validation set of 100 images. The choice of 
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Satimage USPS Pendigits 






Fig. 2. Test error curves for Random Forest and OC1 Forest vs. C02 Forest as a function of (log) number of trees on the multi-class classification 
benchmarks. On the last four datasets, OC1 implementation is prohibitively slow, hence not applicable. 


Technique 

16 trees 

32 trees 

Random Forest 

32.28 

34.61 

Two-probe Forest 

36.03 

38.61 

C02 Forest 

40.33 

42.55 


TABLE 2 

Test Jaccard scores comparing C02 Forest to baseline forests on the 
Labeled Faces in the Wild (LFW) dataset. 


rj = 10 -4 and v — 1 achieves the highest validation Jaccard 
score of 41.87, and are used in the final experiments. 

We note that some other tree-like structures |28][ and 
more sophisticated Computer Vision systems built for face 
segmentation ]29 [ achieve better segmentation accuracy on 
LFW. However, our models use only raw pixel values, and 
our goal was to compare C02 Forest against forest baselines. 

7 Conclusion 

We present Continuously Optimized Oblique (C02) Forest, 
a new variant of random forest that uses oblique split 
functions. Even though the information gain criterion used 
for inducing decision trees is discontinuous and hard to 
optimize, we propose a continuous upper bound on the 



Fig. 4. Test Jaccard scores on LFW for (left) forests of 16 trees with 
different tree depth constraint from 10 to 30 (right) forests with different 
number of trees from 1 to 32. 


information gain objective. We leverage this bound to op¬ 
timize oblique decision tree ensembles, which achieve a 
large improvement on classification benchmarks over a ran¬ 
dom forest baseline and previous methods of constructing 
oblique decision trees. In contrast to OC1 trees, our method 
scales to problems with high-dimensional inputs and large 
training sets, which are commonplace in Computer Vision 
and Machine Learning. Our framework is straightforward 
to generalize to other tasks, such as regression or structured 
prediction, as the upper bound is general and applies to any 
form of convex loss function. 
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Input Ground truth Axis-aligned Two-probe C02 



Input Ground truth Axis-aligned Two-probe C02 




Fig. 3. The above shows side-by-side comparisons of classification results on images from the test set. From left to right, these are the input image, 
the ground truth labels, the outputs of Random Forest, Two-probe Forest, and C02 Forest. The first column depicts images chosen to show a 
spread of different segmentation qualities with the first three the median image by Jaccard score for each method, and the last two, examples with 
highest mean and variance in scores across methods. The second column shows five random examples. 
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